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Experimenter Expectancy. Covert Communication, and Meta-Analytic Methods 

Robert Rosenthal 
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Exactly thirty years ago I presented my first paper at APA. Both 
presentations, that one in Cincinnati in 1959 and this one in New Orleans in 1989, 
would have been quite impossible without Donald Campbell. 

Donald Campbell is not only a brilliant scholar of the social and behavioral 
sciences, he is an inspired and inspiring teacher as well; one who has affected the 
intellectual lives of scientists and scholars of all kinds. His impact on me was 
enormous. In addition to his intellectual inspiration, he provided me with great 
personal support, thirty years ago when I was engaged in very controversial research 
on the unintended effects of psychological researchers on the results of their 
research. At that time he was one of the very few established psychologists to speak 
out on behalf of a "backwoods psychologist** conducting research at the University of 
North Dakota. Conducting research, it should be noted, that remained successfully 
unpublished for years. (Two other of my psychological sponsors at that time were 
Harold Pepinsky-who. fittingly, had developed the very concept of psychological 
sponsor--and Hank Riecken. who had anticipated so much of the work on the social 



psychology of the psychological experiment, nnd who was responsible fur the 
financial support of the National Science Foundation for the work I was doing in 
those early days. 

My first communication from Don Campbell was in a letter he wrote on 
December 1, 1958. in which he agreed to contribute to a symposium on the problem 
of experimenter bias at the forthcoming APA. A long correspondence followed in 
which he gave invaluable advice on organizing the symposium and later, on 
publishing a book on the topic. Recently re-reading this correspondence showed me 
just how good a mentor Don Campbell was, even by mail. 

I take pride, in primitive identification with Don. that we both spent time 
studying at UC Berkeley (he a lot and I a Uttle); that we both taught at Ohio State 
(he a lot and I a little); and that we both published in unrefereed journals (he a little 
and I a lot). 

It seems consistent with the Campbellian spirit for me to discuss today some 
matters that are substantive and some matters that are methodological. We begin 
with the substantive. Those who know me best will be surprised: I am not 
presenting the results of our most recent studies of covert communication in 
classrooms, clinics, courtrooms, or laboratories. Instead. I want to propose a compact 



"theory" of the mediation nf teacher expectation effects, I will describe the theory, 
and in doing so t^u^gest j research agenda for its investigation, We will have a brief 
look at the nature of the th^M^rv. consider some structural and dvnamic features, and 
the role of (u) variouis channeU of communicatirn, ibi molar versus molecuhir 
variables, (c redundancy versus specificity, (d) chcinnel discrepancy, and le) 
interactional svnchronv, Finallv we consider direct interventions to test the theory 

tt) *r m 

and some future directions. 
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The Affect Effort Theory of the Mediation of Tcarrier KxpectiUion Efftcta*. 

A Re^icaroh Agenda 

The afTect effort theory states th;U a change in the level of expectatioi.s held by 
a teacher for the intellectual performance of a student is translated into < a) a chant;e 
in the affect shown by the teacher toward that student and, relatively independently, 
(b) a change in the degree of effort exerted by the teacher in the teaching of that 
student. Specifically, the more favorable the change in the level of expectation held 
by the teacher fcr a particular student, the more positive the affect shown toward 
that student and the greater the effort expended on behalf of that student. The 
increase in positive affect is presumed to be a reflection of increased liking for the 
student for any of several plausible reasons iJussim. 198B). The increase in teaching 
effort is presumed to be a reflection of an increased belief on the part of the teacher 
that the student is capable of learning so that the efifort is worth it I Rosenthal & 
Jacobson. 1968: Swann & Snyder. 1980). 
Structural Features 

The affect/effort theory is consistent with the theoretical writings of most of the 
workers in this area of research (e.g., Brophy. the Coopers {Harris and Joel). Darley. 
Deaux, Dusek. Fazio. Good. Jones. Jussam, Miller, Snyder. Swann, Turnbull, Zanna, 
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and others), most of whiim would probubiy find it coni^enial. In addition, the 
conceptual distinction between the atTect and effort factors maps nicely onto the 
affect cognition distinction recently under fruitful debate by Laj^arus il984^ :ind 
Zajonc (1984). The neuroanatomic evidence, in particular, gives a strong Bayesian 
prior probability to the likelihood -^f the importance and relative independence of the 
affect and effort factors. 

The afTect effort theory is also consistent with tbut not directly demonstrated 
by) the results of a recent set of 31 meta-analyses investigating the older four-factor 
"theory" of the mediation of interpersonal expectancy effects (Harris & Rosenthal. 
1985). Although our meta-analytic work has given strong support to each of the four 
"factors" of climate, input, feedback, and response opportunity, there are virtually no 
data available to permit us to conclude that these four "factors" are. in fact, 
relatively orthogonal. We plan to do principal components analyses of a large set of 
variables serving to define the four factors. The prediction is that must of the dozens 
of variables involved will turn out to load substantially either on the affect i roughly 
climate) or the effort (roughly input) component, after varimax rotation. Our 
prediction is not that only two "significant" compt)nents will emerge, but rather that 



our two ompunenti? .jf aftVct and ^'ffnrt will d»unin:Ue .iv»;r other emorijing 

components. 

Dynamic Featu re s 

The emergence of two relatively orthogonal and reiatively important Un the 
sense of the ^ium uf the squared factor loadings) components, of affect and effort 
provides neces^iary but not sufficient evidence for the theory. It is also necessary to 
show that the magnitude of teacher expectation effects depends upon a difFer>;ntial 
increase in positive affect and teaching effort directed toward those students for 
whom more favorable expectations have been created experimentally, compared to 
the students of the control group. 

The specific predictions from affect effort theory are that there will be a 
substantial positive correlation (a> between the favorableness oi' lhe« expectation 
induced and the increase in positive affect and tea- hing effort, and (b' between the 
increase in positive affect and teaching effort and the increase in .subsequent studerc 
intellectual perfomance. Any theory of the mediation of interpersonal expectancy 
effects must provide evidence for the relationship between la) expectations and the 
mediators and (b) mediators and the behavior of the expectee or target (Rosenthal, 
1981). 



C o m m u n i c a l j o n C ha n n v 1 s 

AfTecti?lTo/t tlu^ory pr^niiot^ that the fai^tnr nf tt^achinhT etTrirt depends mn^t 
heiivilv on the verb;d ohannel of communication with some contribution from such 
ntmverbal channels aj? tavial expresisirtn. body movtjmtent. and tone of voice. The 
factor of alTect, however, is predicted to depend at least as much on the nonverbal 
channels as on the verbal channel of communication. This prediction is based »m the 
association of cognitive with Unvruistic functioning and the association of affective 
with paralingulstic functioning iBuck, 1984; Ekman. 1973; Blanok. Buck, & 
Rosenthal. 1986). 

Overall teaching effort can be defined by the mean ratings made by videotape 
raters on >uch variableii as amount of mated ti taught, task orientation, teaching 
effort expended, and active, competent, and professional demeanor. These raters 
have access to the full videotape, including sound track. Four other groups f>f 
randomly assigned raters have access only to (a) the written transcript of what the 
teachers said; (b) the teachers' faces while teaching; <cJ the teachers' bodies while 
teaching; and (d) the teachers' tones of voice while teaching based on content- filtered 
speech (Rosenthal. 1987). 



Overall pobitivent^s of jfrecl can he defined by ih*.- moan ratinj^s nsadt' by 
vid►^^;lpe r;jU'rs on such vuriiihlos ;is wurm. friendly. likiibh\ pleasanl, ca^in^^ and 
empnthii-. As in the case t>f the teaching; etTurl variable. ralini:!i; are made by five 
groups of rand(mily assigned raters. One of these groups has acces.* to all video and 
audio information but the remaining; groups have access )nly ta 'a^ the transcipl of 
what the teachers siiid: (b) the leiichers' faces: (c) the teachers' bodies: and (d) the 
teachers' tunes of voice based on content-nUered speech. 
M( > 1 ar y e rs us Mt ) 1 ec u I a r Va r i a b 1 e s 

AiTect effort theory predicts that the factor of teaching effort is associated mure 
strongly with more molecular variables involving countinti or timing than with more 
molar, global variables involving overall ratings, while the opposite is true I'or the 
factor of positivity of affect. Thus, for example, we predict that teaching effort can be 
relatively mure efficiently assessed by mure molecular variables such as time on 
task, work-related contacts, speech rate, and number of words taught, than by such 
variables as ratings of teaching effort expended or activity level. In the case of affect, 
the theory predicts that more molar ratings of e.g.. warmth, empathy, or friendliness 
will better assess affect than will more molecular variables such as smiling, 
glancing, nodding, leaning, pitch level, or pitch range. This is a counter- 
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psychometric predicti*'n. since molecular variables Ivnd i ) iie far mure ryliublc ih ui 
mulnr variables {Rosenthal. MHiH; 1987). Nrvcrthoicss. aiTectrlTort the'?ry 

prt'dicts that mj-lar variables will corrclatt? m(^re highly with the criterion atTect 
variable' than will the ..inlecular variables. We predict this hecraise inlerpersonally 
cnmmunicattni affect implicates the use oi' many channels of verbal and nonverbal 
cummunicatiun and mulecular variables tend to be more channel limited than molar 
variables. Since the factor of teaching effort depends more heavily un a single 
channel, the verbal, it will be better indexed by molecular speech- related variables 
than by more mular variables. 
Redundancy V ersus S^eci Rci ty 

.•\ffect etTort theory states that eiTort h characterized by greater simplicity and 
unity and less potential for conni?t and ambivalence than is the case for affect. 
Therefore, when molar variables are assessed in the verbal, face. body, and time 
channels, effort will show greater channel-to-channel redundancy than will affect 
which v/ill show greater channel specificity. Redundancy is measured either by the 
eigenvalue of the first unrotated principal component or. more simply, by the 
average intercorrelation among the four channels of communication. 

^This variable is defined by the composite variable formed from the principal 
components analysis but with unit weighting- Rosenthal. 1987, Chapter 5. 



Chiin n|e] D i i>c repa ncy 

On the basi!* of a rich clinical iradilinn (e.^r.. !>at*''>on. Jufkr-on, Haley. & 
Weaklnnd. 1956). and of niury rect?nt enipiriral work by BuKcntnl's jtjruup ie.if., 
Bugcntal, l.ove, Kaswtm. April. 1971'. by I)e Paulo & Rosenthal ! .19T9). and 
\)thers. there is reason to suspt'ct that teacheri showing greater distTepanfies 
between the channels s e.g.. larg:er differences in positivity expressed between verbal 
content and body movements or tone of voice* vvUl differ in the magnitude of 
interpersonal expectancy effects shown. Since channel discrepancies are associated 
with perceptions of negative affect, teachers showing characteristic discrepancies 
may show smaller effects of positive expectations that have been induced 
experimentaily. 

Although we have been speaking of channel discrepant communication as a 
trait-like, stable moderating variable, it should be noted that we can also consider it 
as a state-like, situational, mediating variable. Indeed, we will be examining 
channel discrepant communications as mediating variables with the prediction that 
discrepant communications will function as more ailectively negative than would be 
predicted from the mean affective level of the two channels involved. 
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Interactional Synch rgn v 

AfTectefTort tht?ory implies r.hat as a i'i>ns*?quyni'e Dflhe increased positivity of 
affect and of teaching effort that typically follows an increase in favorable expt'c 
tatlon there will be an improvement in the rapport or micro-climate of the teacher- 
student dyad. This increased rapport can be assessed by measures of interactional 
synchrony and it will predict the magnitude of improvement of students' intellectual 
performance I Bernieri. Reznick. & Rosenthal, 1988; Bernieri & Rosenthal, in press). 
Interactional synchrony, then, functions as an additional post affecteffort mediator 
Gvcurring before increased student performance. 
Direct Interventio n 

An additional strong test of afTect'efTort theory is possible by attempting to 
achieve direct experimental control of the mediating factors. We can manipulate 
experimentally both the affect and the effort factors. Our basic independent 
variables will be high versus low levels of positive affect and high versus low levels 
of teaching effort in a 2X2 design. By training teachers to show all four possible 
combinations of affect and effort we can test directly the effects of both factors on 
student learning. Although our primary goal would be cross-validation of 
affect/effort theory, this research would also ser e as part of a useful foundation for 
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future programt? of applied research desij^iit'd to improve student performance by 
using research results frnm the litenUure i^f interpersonal expectation effects. 
t^'uture Di re cti o n s 

We plan to extend the generality of affect-effort theory to other domains: 
specificaHy. to the domains ijf counseling, psychotherapy, medicine, and manage 
ment. We believe that utTect effort theory applies as well to these domains as to the 
domain of education. The primary conceptual adjustment that must be made is in 
the nature of the effort factor. For the educational context the efTort is teaching 
effort. For the counbeiing and psychotherapy contexts, the effort is the effort after 
understanding. For the medical and management contexts, the effort is problem- 
solving efTort. 
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Sornt? Methodological Matters 

The methodological portii>n of my talk is de.sigm'd in part both to comfort the 
afilieted and to afflict the comfortable. The afilicted are those of us who work in the 
softer, wilder areas of our field -the areas where the results -^eem ephemeral and 
unreplicable, and where the r"s seem always to be approaching zero as a limit. 
These softer, wilder areas include those of social, personality, clinical, develop- 
mental, educational, organizational, and health psychology. They also include parts 
of psychobiology and cognitive psychology. 

My message to those of us toiling in these muddy vineyards will be that we are 
doing better that we might have thought. My message to those of us in any areas in 
which we feel we have pretty well nailed things down will be that we haven't, and 
that we could be doing a whole lot better. 
How Large Must an Effect Be. To Be Important? 

There is a bit of good news-bad news abroad in the land. The good news is that 
more sophisticated editors, referees, and researchers are becoming aware that 
reporting the results of a significance test is not a sufRciently enlightening proce- 
dure to stand alone. More and more we are beginning to see a report of the magni 
tude of the effect accompanying the p level. The bad news is that we are stiu not 
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quite sure what to do with slich :i report of the magnitude of the effect, for example, a 

\ 

correlation coefficient. 

There is one bit of training that ail psychologists have undergone. From under- 
graduate days onward we have all been taught that there is only one proper, decent 
thing to do whenever we see a correlation coefUcient-we must square it. For most of 
the softer, wilder areas of psychology, squaring the correlation coefficient tends to 
make it go away-vanish into nothingness as it were. That is one of the sources of 
malaise in the social and behavioral sciences. It is sad and quite unnecessary, as we 
shall soon see. 

The Physician's Aspirin Study 

At a special meeting held on December 18. 1987, it was decided to end 
prematurely, a randomized double blind experiment on the efTects of aspirin on 
reducing heart attacks (Steering Committee of the Physicians' Health Study 
jfleseaxch Group, 19S8). The reason for this unusual termination of such an experi- 
ment was that it had become so clear that aspirin prevented heart attacks (and 
deaths from heart attacks) that it would be unethical to continue to give half the 
physician research subjects a placebo. Now what do you suppose was the magnitude 
of the experimental effect that was so dramatic as to call for the termination of this 
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research? Was r-' .90 so that the corresponding r's would have been .95? No. Well, 
was r-' 50. .30. or even .20. so that the corresponding r's would have been .71. .55. or 
.45? No. Actually, what was, was .0011. with a corresponding r of .034. 

Roughly 1 percent of the physicians taking aspirin compared to 2 percent of the 
physicians taking placebo suffered heart attacks. One way of showing the practical 
importance of even a small r is by means of a Binomial Effect Size Display (BESD). 
In such a display, the correlation is shown to be the simple difference in outcome 
rates between the experimental and the control groups in a standard table which 
always adds up to column totals of 100 and row totals of 100 (Rosenthal & Rubin, 
1982b). 

This type of result seen in the physicians' aspirin study is not at all unusual in 
biomedical research. Some years earlier, on October 29, 1981. the National Heart. 
Lung, and Blood Institute discontinued its placebo-controlled study of propranolol 
because results were so favorable to the treatment that it would be unethical to 
continue withholding the life-saving drug from the control patients. And what was 
the magnitude of this effect? Once again the effect size r was .04. and the leading 
digits of the were .00! As behavioral researchers we are not used to thinkinir of r*s 
of .04 as reflecting effect sizes of practical importance. But when ve think of an r of 
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.04 as reflecting a 4*1' decrease in heart attacks, the interpretation given r in a 
Binomial Effect Size Display, the r does not appear to be quite so small: especially if 
we can count ourselves among the 4 per 100 who manage to survive. 

These results of biomedical studies are not flukes. For example, the correlation 
between alchohol abuse and having served in Vietnam is well-known, but the actual 
correlation is .07 (Centers for Disease Control, 1988). The effects of AZT on survival 
in treating AIDS are reflected in an r of .23 (Barnes. 1986), and the effects of 
cyclosporine in preventing the rejection of an organ transplant are associated with 
an r of .19 (Canadian Multicentre Transplant Study Group, 1983). The effects of 
psychotherapy associated with an r of .32 are larger than any of these biomedical 
relationships (Smith & Glass. 1977). Once we begin to think of the correlation 
coefficient as reflecting the difference in outcome rates between the experimental 
and control groups we begin to see that we are doing considerably better in our 
"softer, wilder** sciences than we may have thought we were doing (Rosenthal & 
Rubin, 1982). 

So far, our conversation has been intended to comfort the afflicted. In what 
follows the intent is a bit more to afflict the comfortable. We consider, first, the topic 
of replication. 
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The Meaning of Successful Replicat ion 
There is a long tradition in psychology of our urging one another to replicate 
each others research. But. although we have been very good at calling for 
replications we have not been very good at deciding when a replication has been 
successful. The issue we now address is: When shall a study be deemed successfully 
replicated? 

Successful replication is ordinarily taken to mean that a null hypothesis that 
has been rejected at time 1 is rejected again, and with the same direction of outcome, 
on the basis of a new study at time 2. We have a failure to replicate when one study 
was significant and the other was not. Let us examine more closely a specific 
example of such a "failure to replicate." 
Pseudo-Failures to Replicate 

The saga of Smith and Jones. Smith has published the results of an experiment 
in which a certain treatment procedure was predicted to improve performance. She 
reported results significant at p<.05 in the predicted direction. Jones publishes a 
rebuttal to Smith claiming a failure to replicate. In situations of that sort it turns 
out often to be the case that, although Smith's results were more significant than 
Jones's, the studies were in quite good agreement as to their estimated sizes of effect 
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as defined either by Cohen s d [i Mean, - Mean ,) a] or by r. the correlation between 
group membership and performimce score (Cohen. 1977: 1988; Rosenthal, 1984). 
Thus, studies labeled as "failures to replicate" often turn out to provide strong 
evidence for the replicability of the claimed effect. 

On the odds againat replicating significant results. A related error often found 
in the behavioral and social sciences is the implicit assumption that if an effect is 
"real " we should therefore expect it to be found significant again upon replication. 
Nothing could be further from the truth. 

Suppose there is in nature a real effect with a true magnitude of = .50 (i.e.. 
[Meanj - Mean J ' a = .50 a units), or. equivalently r = .24 (a difference in success 
rate of 62^c versus 38*^). Then suppose an investigator studies this effect with an .V 
of 64 subjects or so. giving the researcher a level of statistical power of .50, a very 
common level of power for behavioral researchers of the last 30 years (Cohen, 1962: 
Sedlmeier & Gigerenzer. 1989). Even though a d of .50 or an r of .24 can reflect a 
very important effect (as we saw earlier in this paper), there is only one chance in 
four that both the original investigator and a replicator will get resulte significant at 
the .05 level. If there were two replications of the original study there would be only 
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one chance in eight that all three studies would be significant, even though we know 
the effect in nature is very real and very important. 
Contrasting Views of Replication 

The traditional, not very useful view of replication has two primary 
characteristics: 

(1) It focuses on significance level as the relevant summary statistic of a 
study, and 

f2) It makes its evaluation of whether replication has been successful in a 
dichotomous fashion. For example, replications are successful if both or neither 
p<.05 (or .01. etc.), and they are unsuccessful if one p< .05 (or .01. etc.) and the other 
p>.05 (or .01. etc.). Psychologists' reliance on a dichotomous decision procedure 
accompanied by an untenable discontinuity of credibility in results varying in p 
levels has been well documented (Nelson, Rosenthal, & Rosnow, 1986; Rosenthal & 
Gaito, 1963, 1964). 

The newer, more useful views of replication success have two primary 
characteristics: 

1. A focus on effect size as the more important summary statistic of a study 
with only a relatively minor interest in the statistical significance level, and 



2. An es'aluation of whether replication has been successful made in a cm 
tinuous fashion. For example, two studies are not said to be successful or unsuccess 
fui replicates of each other, but rather the degree of failure to replicate is specified. 
Some Metrics of the Success of Repl ication 

,Differences between effect sizes. Once we adopt a view of the success of repli 
cation as a function of similarity of effect sizes obtained, we can become more precise 
in our assessments of the success of replication. Replication success could be indexed 
by the difference between the effect sizes obtained in the original study and in the 
replication. For example, we could employ the differences in Cohen's ofs or the effect 
size r's obtained, or we could employ Cohen's q, which is the difference between r's 
that have been first transformed to Fisher's Z's. Fisher's Z metric is distributed 
nearly normally and can thus be used in setting confidence intervals and testing 
hypotheses about r's, whereas r's distribution is skewed and the more so as the 
population value of r moves further from zero. Cohen's q is especially useful for 
testing the significance of difference between two obtained effect size r's (Rosenthal. 
1984; Rosenthal & Rubin, 1982a. Snedecor & Cochran. 1980). When there are more 
than two effect size r's to be evaluated for their variability (i.e.. heterogeneity) we 
can simply compute the standard deviation (S) among the r's or their Fisher Z 
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equivalents. If a test of significance of heterogeneity of these Fisher ^Ts is desired, a 
simple X* test of heterogeneity is readily available (Hedges. 1982; Rosenthal & 
Rubin, 1982a). 

Meta-analytic metrics. As the number of replications for a given research 
question grows, a full assessment of the success of the replicational effort requires 
the application of meta-analytic procedures. An informative summary of the meta- 
analysis might be the stem-and-leaf display of the effect sizes found in the meta- 
analysis (Tukey. 1977). A more compact summary of the effect sizes might be 
Tukey's (1977) box plot, which gives the highest and lowest obtained effect sizes 
along with those found at the 25th. 50th, and 75th percentiles. For single index 
values of the consistency of the effect sizes, one could employ (a) the range of effect 
sizes found between the 75th (Q^) and 25th (Q,) percentile, (b) some standard 
fraction of that range (e.g., half or three-quarters), (c) S. the standard deviation of 
the effect sizes, or (d) SE, the standard error of the effect sizes. 

As a slightly more complex index of the stability, replicability, or clarity of the 
average effect size found in the set of replicates, one could employ the mean effect 
size divided either by its standard error (S/Vfe where A; is the total number of 
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replicates^, or simply by .S. The latter indvx >f mean t?ffect size divided by its 
standard deviation iS) is the reciprx-al of the coetTicient of variation ur a kind of 
coefficient of robustness. 
What Should Be Report ed? 

Effect sizea and sigmfh^nce tetits. If we are to talte seriously our newe; view of 
the meaning of the success of replications, what should be reported by authors of 
papers seen to be replications of earlier studies? Clearly, reporting the results of 
tests of significance will not be sufficient- The effect size of the replication and of the 
original study must be reported. It is not crucial which particular effect size is 
employed, but the same effect size should be reported for the replication and the 
original study. Complete discussions of various effect sizes and when they are useful 
are available from Cohen (1977. 1988) and elsewhere (e.g., Rosenthal, 1984). If the 
original study and its replication are reported in different effect size units these can 
usually be translated to one another (Cohen. 1977, 1988: Rosenthal, 1984; Rosenthal 
& Rosnow, 1984; Rosenthal & Rubin, in press). 

Power. Especially if the results of either the original study or its replication 
were not significant, the statistical power at which the test of significance was made 
(assuming, for example, a population effect size equivalent to the effect size actually 
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obtained' should bt? reporttrd <C'^hen. 1988^, In addition tn reportinsx the sUitistical 
power for each study separately, it would be Vi:luablo to report the overall 
probability th«t both studies would have yielded significant results given, for 
example, the effect size estimated from the results of the original and the replication 
study combined. 

The equally likely effect size. A marvelous suggestion has been made by Donald 
Rubin that would go i\ long way toward helping us get uver our problem with the 
relative risks of type 11 versus type I errors. Don has suggested that whenever we 
conclude that thert is "no effect" we report both the effect size and that confidence 
interval around the effe<-^ size that ranges from th»» effect size of zero to the equally 
likely effect size greater than the one we obtained. For example, suppose a 
replicator. Jones, did not reject the null but obtained an effect size of d = .50. If Jones 
had been required to report that his d of.50 was just as close to a d of 1.00 as it was to 
a d of zero, Jones would have been less likely to draw his wrong conclusion that he 
had failed to replicate Smith's work who had found a very similar effect size. 

Meta-Analytic Procedures: Some Benefits 

Any discussion of replication and of the evaluation of the success of a particular 
replication cannot avoid a more formal consideration of meta-analytic procedures. 
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In the years 19S0. 1981. and 198*2 ;i'one. weil over 300 papers were published 
on the topic of meta-analysU (Lamb and VVhitUi. 1983). Does this represent a giant 
stride forward in the development of the behavioral and social sciences or does it 
signal a lemming-like Hight to disaster? Judging from reactions to past meta- 
analytic enterprises, there are at least some who take the more pessimistic view. 
Some three dozen scholars were invited to respond to a meta-analysis of studies of 
interpersonal expectancy effects conducted by Don Rubin and myself (Rosenthal & 
Rubin. 1978). Although much of the commentary dealt with the substantive topic of 
interpersonal e.xpectoney elTects. a good deal of it dealt with methodological aspects 
of meta-analytic procedures and products. Some of the criticisms offered were 
accurately anticipated by Glass (1978) who had earlier received commentary on his 
meta-analytic work (Glass. 1976) and that of his colleagues (Smith & Glass, 1977: 
Glass. McGaw, & Smith, 1981). These criticisms have been detailed and addressed 
elsewhere (Rosenthal, 1989). Today, therefore, I want to use the time that remains 
to note a number of special benefits of meta-analysis. Some of these benefits are well 
known, but some are not--indeed. some are most arcane. 
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Most Obvious Benefits 

Completeness. Meta-analytic consideration of a research domain is more 
complete and exiiaustive though this does not mean that all studies found are 
weighted equally. Indeed, every study should be weighted from zero to any desired 
number. These weights, of course, must be defensible. (It will not do to weight all 
my results + 1,00 and all my enemies* results 0.00). 

Explicitness. The quantitative nature of the process of obtaining effect sizes, 
standard normal deviates, and weights, forces explicitness on the analyst Vague 
terms like '^no relationship," "some relationship," a "strong relationship," '*very 
significant.** are replaced by numerical values. 

Power. Empirical work has shown that meta-analytic procedures increase 
pow«»r and decrease type 2 errors (Cooper & Rosenthal, 1980). 
Less Obvious Benefits 

Moderator variables. These are more easily spotted and evaluated in a context 
of a quantitative research summary. This aids theory development and increases 
empirical richness. 

Cumulation problems. Meta-analytic procedures address, in part, the chronic 
complaint that social sciences cumulate so poorly compared to the physical sciences. 
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It should be noted that recent historical and sociological investigations have sug 
gested that the physical sciences may not be all that much better off than we are 
when It comes to successful replication (Collins, 1985: Hedges. 1987; Pool 1988), For 
example, Collins (1985) has described the failures to replicate the construction of 
TEA-lasers despite the availability of detailed instructions for replication. 
Apparently TEA-lasers could be replicated dependably only when the replication 
instructions were accompanied by a scientist who had actually built a laser. 
Least Obvioua Benefits 

Decrease in oueremphasis on single studies. One not so obvious benefit that will 
accrue to us is the gradual decrease in the overemphasis on the results of a single 
study. There are good sociological grounds for our monoraaniacal preoccupation 
with the results of a single study. Those grounds have to do with the reward system 
of science wher^ recognition, promotion, reputation, and the like depend on the 
results of the single study, also known as the smallest unit of academic currency. 
The study is "good," "valuable," and above all, "publishable" when p a .05. Our disci- 
plines would be further ahead if we adopted a more cumulative view of science in 
which the impact of a study were evaluated less on the basis of p levels, and more on 
the basis of its own effect size and on the revised effect size and combined probability 
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that resulted from the addition of the new study to any earlier studies investigating 
the same or a similar relationship. This, of course, amounts to a call for a more meta- 
analytic view of '*doing science," 

B. F .Skinner has been eloquent in his comments on the overvaluation of the 
single study: '*In my own thinking. I try to avoid the kind of fraudulent significance 
which comes with grandiose terms or profound 'principles.' But some psychologists 
seem to need to feel that every experiment they do demands a sweeping reorgan- 
ization of psychology as a whole. It's not worth publishing unless it has some such 
significance. But research has its own values, and you don't need to cook up spurious 
reasons why it's important." 'Skinner. 1983, p. 39). 

"The new intimacy. " This new intimacy is between the reviewer and the /^ata. 
We cannot do a meta-analysis by reading abstracts and discussion sections. We are 
forced to look at the numbers and. very often, compute the correct ones ourselves. 
Meta-analysis requires us to cumulate data, not conclusions. ''Reading * a paper is 
quite a different matter when we need to compute an effect size and a fairly precise 
significance level-often from a results section that never heard of effect sizes, precise 
significance levels (or the APA publication manual)! 



The demi.-ie of the dichotomous .<ii^ntficance tefitma deci.'mm. Far more than is 
good for us. social and behavioral scientists operate under a dichotomaus null 
hypothesis decision procedure in which the evidence is interpreted as anti-null ifp ' 
.05 and pro-null if p > .05. If our dissertation p is < .05 it means joy. a Ph.D.. and a 
tenure-track position at a major university. If our p is > .05 it means ruin, despair, 
and our advisor's suddenly thinking of a new control condition that should be run. 
That attitude really must go. God loves the .06 nearly as much as the .05. Indeed, I 
have it on good authority that she views the strength of evidence for or against the 
null as a fairly continuous function of the magnitude of p. As a matter of fact, two .06 
results are much stronger evidence against the null than one .05 result; and 10 p's of 
.10 are stronger evidence against the null than 5 p's of .05. 

The overthrow of the omnibus test- It is common to find specific questions 
addressed by F tests with df > I in the numerator or by tests with df > 1. For 
example, suppose the specific question is whether increased incentive level improves 
the productivity of work groups. We employ four levels of incentive so that our 
omnibus F test would have 3 df in the numerator or our omnibus would be on at 
least 3 df. Common as these tests are, they reflect poorly on our teaching of data 
analytic procedures. The diffuse hypothesis tested by these omnibus tests usually 
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tells us nothing of importance about our research question. The rule of thumb is 
unambiguous: Whenever we have tested a fixed effect with dt > I for x~ or for the 
numerator of F, we have tested a question in which we are almost surely not 
interested. 

The situation is even worse when there are several dependent variables as well 
as multiple df for the independent variable. The paradigm case here is canonical 
correlation and special cases are MAXOVA. MANXOVA, Multiple discriminant 
function, multiple path analysis, and complex multiple partial correlation. While all 
of these procedures have useful exploratory data analytic applications they are 
commonly used to test null hypotheses which are scientifically almost always of 
doubtful value. The effect size estimates they yield (e.g., the canonical correlation) 
are also almost always of doubtful value. 

This is not the place to go into detail, but one approach to the problem of 
analyzing canonical data structures is to "educe the set of dependent variables to 
some smaller number of composite variables using the principal-components- 
followed-by-unit-weighting approach. Each composite can then be analyzed serially. 
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Meta-analytic questions are basically contrast questions. F tests with df> 1 in 
the numerator or s with df >l are useless in meta-analytic work. That leads to 
an additional scientific benefit: 

The increased recognition of contrast analysis. Meta-analytic questions require 
precise formulation of questions and contrasts are procedures for obtaining answers 
to such questions, often in an analysis of variance or table analysis context. 
Although most textbooks of statistics describe the logic and the machinery of 
contrast analyses, one still sees contrasts employed all too rarely. That is a real pity 
given the precision of thought and theory they encourage and (especially relevant to 
these times of publication pressure) given the boost in power conferred with the 
resulting increase in .05 asterisks (Rosenthal & Rosnow. 1985). 

A probable increase in the accurate understanding of interaction effects. 
Probably the universally most misinterpreted empirical results in psychology are 
the results of interaction effects. A recent survey of 191 research articles involving 
interactions found only two articles that showed the authors interpreting inter- 
actions in an unequivocally correct manner (i.e., by examining the residuals that 
define the interaction) (Rosnow & Rosenthal. 1989). The rest of the articles simply 
compared means of conditions with other means, a procedure that does not 
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investigate interaction effects but rather the sum of main effects and interaction 
efTects. 

Most standard textbooks of statistics for psychologists provide accurate 
mathematical definitions of interaction effects but then interpret not the residuals 
that define those interactions but the means of cells that are the sums of all main 
effects and all interactions. 

In addition, users of SPSS. SAS. BMDP, and virtually all other data-analytic 
software are poorly served in the matter of interactions since virtually no programs 
provide convenient tabular output giving the residuals defining interaction. The 
only exception to that of which I am aware is a little-known package called Data- 
Text developed by Arthur Couch and David Armor for which William Cochran and 
Donald Rubin provided the statistical consultation. 

Since many meta-analytic questions are by nature questions of interaction (for 
example, that opposite sex dyads will conduct standard transactions more slowly 
than will same sex dyads), we can be hopeful that increased use of meta-analytic 
procedures will bring with it increased sophistication about the meaning of 
interaction. 
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Meta-analytic procedures are applicable beyond meta analyi^es. Many of the 
techniques of contrast analyses among effect sizes, for example, can be used within a 
single study (Rosenthal & Rosnow. 1985). Computing a single effect size from 
correlated dependent variables, or comparing treatment effects on two or more 
dependent variables serve as illustrations (Rosenthal & Rubin, 1986). 

The decrease in the splendid detachment of the full professor. Meta-analytic 
work requires careful reading of research and moderate data analytic skills. We 
cannot send an undergraduate research assistant to the library with a stack of 5X3 
cards to bring us back "the results." With narrative reviews that seems often to have 
been done. With meta-analysis the reviewer must get involved with the actual data 
and that is all to the good. 

Conclusion 

I hope that the methodological section of this paper has provided some comfort 
to the afflicted in showing that many of the findings of our discipline are neither as 
small nor as unimportant from a practical point of view as we may have feared. 
Perhaps I hope, too, that there may have been some afHiction of the comfortable in 
showing that in our views of replication and of the cumulation of the wisdom of our 
field there is much yet remaining to be done. 
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Appendix 

I. The Problem 

Oh, F is large and g is small 
That's why we are walking tall. 

What it means we need not muH 
Just so we reject the null. 

Or Chi-Square large and g near nil 
Results like that, they fill the bill. 

What if meaning requires a poll? 
Never mind, we're on a roll! 

The message we have learned too well? 
Significance! That rings the bell! 

17. The Implications 

The moral of our little tale? 
That we mortals may be frail 
When we feel a d near zero 
Makes us out to be a hero. 

But tell us then is it too late? 
Can we perhaps avoid our fate? 
Replace that wish to null-reject 
Report the size of the effect. 

That may not insure our glory 
But at least it tells a story 
That is just the kind of yield 
Needed to advance our field. 



o 

ERIC 



43 



